Fix static generation when compiling! #28937

ArthurZucker · 2024-02-09T06:05:19Z

What does this PR do?

Fixes the static cache generation. Comes with #27931

thanks @OlivierDehaene for the insight

https://gist.github.com/ArthurZucker/ae0a86ef8f841c0ef69aaa52ccbc0b03 benchmark

fixes issue with FlashAttention: when the cache is padded you need the full attention mask, otherwise generations will be wrong with generate because the first forward will be fully causal.
fixes graph runs: the cache positions have to be stateless, they are otherwise ignored by the model and the compiled generation are random
fixes potential BC by guarding the use of cache positions

FA2 potential fix if compiled worked:

            # we slice the states for static kv cache to be supported in FA2. Not sure it's a must as compile fails
            if (cache_position is not None):  
                key_states = key_states[:, :, : cache_position[-1] + 1, :]
                value_states = value_states[:, :, : cache_position[-1] + 1, :]

but I have slowdowns:
Slicing

vs no Slicing

HuggingFaceDocBuilderDev · 2024-02-09T06:28:41Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

sanchit-gandhi · 2024-02-09T18:16:27Z

I'm not sure adding a new argument cache_position to the forward call of the model is strictly backwards compatible. Here's an example to motivate this.

The following works on transformers==4.37.2:

import torch
from transformers import AutoModelForCausalLM, LlamaTokenizer

model = AutoModelForCausalLM.from_pretrained("trl-internal-testing/tiny-random-LlamaForCausalLM", attn_implementation="eager")
tokenizer = LlamaTokenizer.from_pretrained("trl-internal-testing/tiny-random-LlamaForCausalLM")

# random input id
inputs = tokenizer("Hey there", return_tensors="pt", return_attention_mask=True)

position_ids = inputs.attention_mask.long().cumsum(-1) - 1
position_ids.masked_fill_(inputs.attention_mask == 0, 1)

with torch.no_grad():
    logits = model.forward(**inputs, position_ids=position_ids).logits

If we run the same code on this PR, we get the following error:

  File "/Users/sanchitgandhi/transformers/src/transformers/models/llama/modeling_llama.py", line 352, in forward
    attn_weights = attn_weights + causal_mask
                   ~~~~~~~~~~~~~^~~~~~~~~~~~~
RuntimeError: The size of tensor a (3) must match the size of tensor b (2048) at non-singleton dimension 4

Full traceback:

  File "/Users/sanchitgandhi/transformers/debug_llama.py", line 14, in <module>
    logits = model.forward(**inputs, position_ids=position_ids).logits
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/sanchitgandhi/transformers/src/transformers/models/llama/modeling_llama.py", line 1106, in forward
    outputs = self.model(
              ^^^^^^^^^^^
  File "/Users/sanchitgandhi/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/sanchitgandhi/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/sanchitgandhi/transformers/src/transformers/models/llama/modeling_llama.py", line 950, in forward
    layer_outputs = decoder_layer(
                    ^^^^^^^^^^^^^^
  File "/Users/sanchitgandhi/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/sanchitgandhi/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/sanchitgandhi/transformers/src/transformers/models/llama/modeling_llama.py", line 694, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
                                                          ^^^^^^^^^^^^^^^
  File "/Users/sanchitgandhi/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/sanchitgandhi/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/sanchitgandhi/transformers/src/transformers/models/llama/modeling_llama.py", line 352, in forward
    attn_weights = attn_weights + causal_mask
                   ~~~~~~~~~~~~~^~~~~~~~~~~~~
RuntimeError: The size of tensor a (3) must match the size of tensor b (2048) at non-singleton dimension 4

This is because cache_positions is not specified to the forward call, and so defaults to None. When we do our reshape in the attention layer:

transformers/src/transformers/models/llama/modeling_llama.py

Lines 352 to 353 in 56768a0

    
           causal_mask = attention_mask[ :, :, cache_position, : key_states.shape[-2]] 
        
           attn_weights = attn_weights + causal_mask

instead of reshaping to [ :, :, cache_position, : key_states.shape[-2]], we reshape to [ :, :, None, : key_states.shape[-2]]. So instead of slicing, we insert an extra dimension! This gives the size mismatch when we add the attention mask to the weights. The user needs to specify cache_position as an argument to the forward call in order for this to work.

Overall, I think we should avoid adding extra arguments that require code changes from the user, especially to the top-level modules which are already highly-used. What about a design more like Flax where we keep track of the cache_position internally in the StaticCache abstraction? This then requires no changes from the user

ArthurZucker · 2024-02-10T01:37:07Z

We can make it BC! this PR is not ready yet, but generate should check the past key value class and if signature can take cache_position, give them. Something like that.

I'll work on making it BC! :)

ArthurZucker · 2024-02-12T02:04:07Z

src/transformers/models/llama/modeling_llama.py

+        past_seen_tokens = 0 
+        if use_cache and not isinstance(past_key_values, Cache):
+            past_key_values = DynamicCache.from_legacy_cache(past_key_values)
+            past_seen_tokens = past_key_values.get_usable_length(inputs_embeds.shape[1]) # kept for BC (cache positions)
+
+        if cache_position is None:
+            cache_position = torch.arange(past_seen_tokens, past_seen_tokens+inputs_embeds.shape[1])


Has to be kept for BC

ArthurZucker · 2024-02-12T02:04:30Z

src/transformers/models/llama/modeling_llama.py

-            if attention_mask is None:
-                return None
-            is_tracing = torch.jit.is_tracing() or isinstance(input_tensor, torch.fx.Proxy)
-            if not is_tracing and (torch.all(attention_mask == 1)):
-                return None
-            if is_tracing and seq_length == 1:
-                return None


all of this failed generations, deal with it later

cc @fxmarty I am warning you in advance 🥶 you might have to do something similar to the prepared_4d_sdpa but this is a lot simpler so for the better

ArthurZucker · 2024-02-12T02:04:50Z

src/transformers/models/llama/modeling_llama.py

+        # TODO @gante we should only keep a `cache_position` in generate, and do +=1. 
+        # same goes for position ids. Could also help with continued generation.
+        cache_position = kwargs.get("cache_position", None)
+        if cache_position is None:
+            cache_position = torch.arange(past_length, past_length+input_ids.shape[1])


kept for BC as well, generate should handle cache positions IMO

src/transformers/models/llama/modeling_llama.py

gante

Pre-approving, as the overall PR shape looks good to me 👍

(btw, this PR is blocking further work on generate, as llama + generate + dynamic cache is not correct at the moment and I want to standardize the interface of the different cache classes to match the static cache)

ArthurZucker · 2024-02-14T00:17:09Z

Thanks, merging asap

ArthurZucker · 2024-02-14T01:07:52Z

Slow tests are happy

ArthurZucker · 2024-02-15T00:17:42Z

src/transformers/generation/utils.py

-    bool_keys = [k for k in keys if isinstance(model_input[k], bool)]
-    non_bool_keys = [k for k in keys if not isinstance(model_input[k], bool) and not k == "encoder_outputs"]
+    bool_keys = [k for k in keys if isinstance(model_input[k], bool) or k == "cache_position"]
+    keys_to_ignore = ["cache_position", "encoder_outputs"]
+    non_bool_keys = [k for k in keys if not isinstance(model_input[k], bool) and k not in keys_to_ignore]


beam search will split the cache positions otherwise

src/transformers/models/llama/modeling_llama.py

younesbelkada

Thanks for the huge work ! I left some minor comments that should be addressed before merging IMO, otherwise we might introduce some breaking change for users that use our public classes without explicit positional arguments

src/transformers/cache_utils.py

src/transformers/models/llama/modeling_llama.py

younesbelkada · 2024-02-15T03:56:41Z

Example of a breaking behaviour that I introduced while working on FA2: #25598 (comment) so we should be careful when adding new args in our modules

Co-authored-by: Younes Belkada <[email protected]>

src/transformers/models/llama/modeling_llama.py

…hub.com:huggingface/transformers into fix-static-kv-cache

younesbelkada

Thank you very much !

* wow I was scared! * fix everything * nits * make it BC? * add todo * nits * is_tracing should still be used to pass tracing tests * nits * some nits to make sure genration works with static cache uncompiled * fix sdpa * fix FA2 for both static and dynamic in a better way? * style * fix-copies * fix fix copies * fix sequential beam searcg * style * use `keys_to_ignore` * nit * correct dtype inference when init * :( the fix for FA2 is still not optimal to investigate! * styling * nits * nit * this might work better * add comment * Update src/transformers/models/llama/modeling_llama.py * "position_ids" -> "cache_position" * style * nit * Remove changes that should no be propagatted just yet * Apply suggestions from code review * Styling * make sure we raise an errir for static cache with FA2 enabled * move to the bottom of the signature * style * Update src/transformers/models/llama/modeling_llama.py Co-authored-by: Younes Belkada <[email protected]> * Update src/transformers/models/llama/modeling_llama.py * nit in the name --------- Co-authored-by: Younes Belkada <[email protected]>

alanwaketan · 2024-02-21T06:46:28Z

Hey @ArthurZucker, I discovered that this change actually breaks TPU...

Now, TPU training with FSDPv2 will produce loss with NaN. I haven't looked into your PR so I'm not sure why. Just bisecting til this change.

ArthurZucker · 2024-02-21T06:48:44Z

Mmm this might be a ROPE issue? #29109 might also play

learning-chip · 2024-03-18T21:00:23Z

Hi @ArthurZucker I run your benchmark script with both transformers 4.38.0 and 4.38.2 but got error:

Traceback (most recent call last):
  File "/home/best_benchmark.py", line 99, in <module>
    generated_ids[:, cache_position] = input_ids.to(device).to(torch.int)
RuntimeError: shape mismatch: value tensor of shape [1686] cannot be broadcast to indexing result of shape [1, 2048]

ArthurZucker · 2024-03-19T09:56:58Z

It is probably out of date! I'll update it

ArthurZucker · 2024-03-27T06:42:13Z

We'll actually push a full benchmark in transformers to make sur we always track this!

* wow I was scared! * fix everything * nits * make it BC? * add todo * nits * is_tracing should still be used to pass tracing tests * nits * some nits to make sure genration works with static cache uncompiled * fix sdpa * fix FA2 for both static and dynamic in a better way? * style * fix-copies * fix fix copies * fix sequential beam searcg * style * use `keys_to_ignore` * nit * correct dtype inference when init * :( the fix for FA2 is still not optimal to investigate! * styling * nits * nit * this might work better * add comment * Update src/transformers/models/llama/modeling_llama.py * "position_ids" -> "cache_position" * style * nit * Remove changes that should no be propagatted just yet * Apply suggestions from code review * Styling * make sure we raise an errir for static cache with FA2 enabled * move to the bottom of the signature * style * Update src/transformers/models/llama/modeling_llama.py Co-authored-by: Younes Belkada <[email protected]> * Update src/transformers/models/llama/modeling_llama.py * nit in the name --------- Co-authored-by: Younes Belkada <[email protected]>

wow I was scared!

2187685

ArthurZucker changed the title ~~wow I was scared!~~ Fix static generation when compiling! Feb 9, 2024

ArthurZucker added 2 commits February 9, 2024 11:58

fix everything

4922c92

nits

56768a0

ArthurZucker mentioned this pull request Feb 11, 2024

[Core generation] Adds support for static KV cache #27931

Merged

4 tasks

ArthurZucker added 2 commits February 12, 2024 02:49

make it BC?

b565051

add todo

99afd1a

ArthurZucker commented Feb 12, 2024

View reviewed changes

ArthurZucker added 3 commits February 12, 2024 03:49

nits

edc498f

is_tracing should still be used to pass tracing tests

651c4bd

nits

f69626e

ArthurZucker marked this pull request as ready for review February 12, 2024 08:08

ArthurZucker added 2 commits February 12, 2024 09:50

some nits to make sure genration works with static cache uncompiled

96136ac

fix sdpa

d5ebd80

gante reviewed Feb 12, 2024

View reviewed changes

src/transformers/models/llama/modeling_llama.py Outdated Show resolved Hide resolved

src/transformers/models/llama/modeling_llama.py Outdated Show resolved Hide resolved

ArthurZucker commented Feb 13, 2024

View reviewed changes

src/transformers/models/llama/modeling_llama.py Outdated Show resolved Hide resolved

ArthurZucker mentioned this pull request Feb 13, 2024

Add AQLM support (experimental) oobabooga/text-generation-webui#5466

Merged

gante mentioned this pull request Feb 13, 2024

Cache: standardize cache interface #29005

Closed

gante approved these changes Feb 13, 2024

View reviewed changes

fix FA2 for both static and dynamic in a better way?

70adcf6

ArthurZucker added 5 commits February 14, 2024 10:08

style

61ed4cb

fix-copies

fedc563

fix fix copies

0195d58

fix sequential beam searcg

07f3adb

style

9402c25

nit

80148ab

ArthurZucker commented Feb 15, 2024

View reviewed changes

src/transformers/models/llama/modeling_llama.py Outdated Show resolved Hide resolved

ArthurZucker and others added 4 commits February 15, 2024 09:20

Remove changes that should no be propagatted just yet

c9f3c82

Apply suggestions from code review

5f54d84

Styling

b3fc042

make sure we raise an errir for static cache with FA2 enabled

5fdb2da

younesbelkada reviewed Feb 15, 2024

View reviewed changes

ArthurZucker and others added 3 commits February 15, 2024 05:22

move to the bottom of the signature

03edf91

style

b762304

Update src/transformers/models/llama/modeling_llama.py

9fbe901

Co-authored-by: Younes Belkada <[email protected]>

ArthurZucker commented Feb 15, 2024

View reviewed changes

src/transformers/models/llama/modeling_llama.py Outdated Show resolved Hide resolved

ArthurZucker and others added 3 commits February 15, 2024 05:34

Update src/transformers/models/llama/modeling_llama.py

7afe7d9

nit in the name

3772d1c

Merge branches 'fix-static-kv-cache' and 'fix-static-kv-cache' of git…

cf0bc32

…hub.com:huggingface/transformers into fix-static-kv-cache

younesbelkada approved these changes Feb 15, 2024

View reviewed changes

ArthurZucker merged commit f3788b0 into main Feb 15, 2024
21 checks passed

ArthurZucker deleted the fix-static-kv-cache branch February 15, 2024 05:27

BlackSamorez mentioned this pull request Feb 17, 2024

Cleaner Cache dtype and device extraction for CUDA graph generation for quantizers compatibility #29079

Merged

5 tasks

ArthurZucker mentioned this pull request Feb 20, 2024

Starcoder2 model #29120

Closed

BlackSamorez mentioned this pull request Feb 22, 2024

Fixed CUDA kernel loading unlocking graph compilation Vahe1994/AQLM#27

Merged

poedator mentioned this pull request Mar 7, 2024

custom 4d attention masks broken by #28937 #29525

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix static generation when compiling! #28937

Fix static generation when compiling! #28937

ArthurZucker commented Feb 9, 2024 •

edited

Loading

HuggingFaceDocBuilderDev commented Feb 9, 2024

sanchit-gandhi commented Feb 9, 2024 •

edited

Loading

ArthurZucker commented Feb 10, 2024

ArthurZucker Feb 12, 2024

ArthurZucker Feb 12, 2024

ArthurZucker Feb 15, 2024

ArthurZucker Feb 12, 2024

gante left a comment

ArthurZucker commented Feb 14, 2024

ArthurZucker commented Feb 14, 2024

ArthurZucker Feb 15, 2024

younesbelkada left a comment

younesbelkada commented Feb 15, 2024

younesbelkada left a comment

alanwaketan commented Feb 21, 2024

ArthurZucker commented Feb 21, 2024

learning-chip commented Mar 18, 2024

ArthurZucker commented Mar 19, 2024

ArthurZucker commented Mar 27, 2024

Fix static generation when compiling! #28937

Fix static generation when compiling! #28937

Conversation

ArthurZucker commented Feb 9, 2024 • edited Loading

What does this PR do?

HuggingFaceDocBuilderDev commented Feb 9, 2024

sanchit-gandhi commented Feb 9, 2024 • edited Loading

ArthurZucker commented Feb 10, 2024

ArthurZucker Feb 12, 2024

Choose a reason for hiding this comment

ArthurZucker Feb 12, 2024

Choose a reason for hiding this comment

ArthurZucker Feb 15, 2024

Choose a reason for hiding this comment

ArthurZucker Feb 12, 2024

Choose a reason for hiding this comment

gante left a comment

Choose a reason for hiding this comment

ArthurZucker commented Feb 14, 2024

ArthurZucker commented Feb 14, 2024

ArthurZucker Feb 15, 2024

Choose a reason for hiding this comment

younesbelkada left a comment

Choose a reason for hiding this comment

younesbelkada commented Feb 15, 2024

younesbelkada left a comment

Choose a reason for hiding this comment

alanwaketan commented Feb 21, 2024

ArthurZucker commented Feb 21, 2024

learning-chip commented Mar 18, 2024

ArthurZucker commented Mar 19, 2024

ArthurZucker commented Mar 27, 2024

ArthurZucker commented Feb 9, 2024 •

edited

Loading

sanchit-gandhi commented Feb 9, 2024 •

edited

Loading